655 ms

Objective

The objective is to build a regression model which predicts houseprice from several housing variables.

5.5 μs

Data Processing

Area and Prices are quantitative variables measured in square-feet and dollars respectively. Garage, FirePlace and Baths refers to the number of this items in a specific house. City is a qualitative variable indicating one of 3 different cities. All remaining variable indicate the presence (1) of absence (0) of that feature.

5.0 μs
variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64Int64Float64Int64NothingNothingDataType
1
:Area
124.93
1
125.0
249
nothing
nothing
Int64
2
:Garage
2.00129
1
2.0
3
nothing
nothing
Int64
3
:FirePlace
2.0034
0
2.0
4
nothing
nothing
Int64
4
:Baths
2.99807
1
3.0
5
nothing
nothing
Int64
5
:WhiteMarble
0.332992
0
0.0
1
nothing
nothing
Int64
6
:BlackMarble
0.33269
0
0.0
1
nothing
nothing
Int64
7
:IndianMarble
0.334318
0
0.0
1
nothing
nothing
Int64
8
:Floors
0.499386
0
0.0
1
nothing
nothing
Int64
9
:City
2.00094
1
2.0
3
nothing
nothing
Int64
10
:Solar
0.498694
0
0.0
1
nothing
nothing
Int64
11
:Electric
0.50065
0
1.0
1
nothing
nothing
Int64
12
:Fiber
0.500468
0
1.0
1
nothing
nothing
Int64
13
:GlassDoors
0.49987
0
0.0
1
nothing
nothing
Int64
14
:SwimingPool
0.500436
0
1.0
1
nothing
nothing
Int64
15
:Garden
0.501646
0
1.0
1
nothing
nothing
Int64
16
:Prices
42050.1
7725
41850.0
77975
nothing
nothing
Int64
2.2 s

Data Exploration

Data was manipulated to investigate one of three things:

  1. Distribution of house price

  2. How area affects the average house price

  3. What house price variable contribute to house price

3.1 μs
9.8 s

The distribution of house price is generally bell shaped, with a high number house-prices in the 35,000 to 40,000 range.

5.3 μs
33.0 ns

Average house prices generally increase with area range, however that trend is broken in the range 80-120 sqft as well as 180-200 sqft.

2.2 μs
33.0 ns

Having white marble type and a fiber connection tends to be a strong predictor of price. Variables like Garage, FirePlace, Baths and Floors which increase with Area also tend to predict higher prices. Non predictors of price seem to includes features like Solar SwimmingPool and Garden.

2.0 μs
1.4 ms

Model Building

The data was split into a training and testing set (70/30). The data science pipeline requires converting variables into a continuous type, then fitting a EvoTree Regressor model to predict house prices, using a max_dept of 8.

2.9 μs
2.9 s
Pipeline527(
    evo_tree_regressor = EvoTreeRegressor(
            loss = EvoTrees.Linear(),
            nrounds = 10,
            λ = 0.0f0,
            γ = 0.0f0,
            η = 0.1f0,
            max_depth = 8,
            min_weight = 1.0f0,
            rowsample = 1.0f0,
            colsample = 1.0f0,
            nbins = 64,
            α = 0.5f0,
            metric = :mse,
            seed = 444)) @883
60.0 ns

Cross Validation

The number of rounds of training (nrounds) was plotted on a learning curve. After about 120 rounds there are diminishing returns for a low rms error, as a result the original model will be retrained with nrounds=128 rather than nrounds=10.

2.6 μs
76.5 s

A sufficient number of rounds is needed to achieve rms > 300, as indicated by the chart below

4.3 μs
365 μs
4.0 s

Model Evaluation Evaluation was done using the metrics rms/mae and a 70% shuffled resampling. A reasonable error was achieved for both metrics.

2.1 μs
2.9 s
measurevalue
MeasureFloat64
1
rms (callable Measure)
229.689
2
mae (callable Measure)
184.437
17.3 μs
238 ms

A rms error of 231.0 was also achieved for the test set.

6.6 μs

Variable Importance

The chart below ranks variables in types of contribuion to the model, under the Shapley framework. Interestingly, Floors, Fiber and WhiteMarble are the best predictors of house price.

3.0 μs
14.3 s

This chart shows how important variables correlate with house price. For example having Indian Marble has a negative correlation on house price and is consistent with the shapely variable importance plot. For other variables having a feature or having more of it correlates with higher house price.

22.0 μs
105 ms